import pandas as pd
import sqlite3
Part 1: Constructing a SQL Database
In this blog, we’ll be using SQL and plotly
to construct aesthetic graphs. The sqlite3
package allows for creating and accessing SQL databases in Python.
First, some basic imports:
You should already be fairly familiar with working with Pandas dataframes. To work with SQL, first we need to establish a connection and create a cursor that can execute queries for us.
= sqlite3.connect("data.db")
conn = conn.cursor() cursor
For this post we’ll be using temperature data collected from various temperature reading stations around the world. The to_sql()
function imports a given dataframe into our SQL database.
def to_db(url,name):
'''
Reads given .csv file into the established SQL database with given name
'''
= pd.read_csv(url)
df =False,if_exists="replace")
df.to_sql(name,conn,index
"temps_stacked.csv","temperatures")
to_db("countries.csv","countries")
to_db("station-metadata.csv","stations") to_db(
Now we can check that we have three different tables in our database:
"SELECT name FROM sqlite_master WHERE type='table'")
cursor.execute( cursor.fetchall()
[('temperatures',), ('countries',), ('stations',)]
Part 2: Extracting Data
Our end goal is to create a graph that addresses the following question:
How does the average yearly change in temperature vary within a given country?
To this end, we need a function to extract the specific data that we want from our database. The key here is knowing how to construct the appropriate SQL command. A tutorial on basic SQL syntax can be found here: https://www.w3schools.com/sql/default.asp
For our function, we want to be able to specify a year range as well as a month, the latter so that temperature variations as seasons change don’t affect the data.
def query_climate_database(country,year_begin,year_end,month):
'''
Returns Pandas dataframe of data from SQL database
where station is located in country and reading is
from month and in range year_begin to year_end
'''
#extract FIPS 10-4 from country
f"SELECT [FIPS 10-4] FROM countries WHERE Name='{country}' LIMIT 1")
cursor.execute(= cursor.fetchone()[0]
fips
#query SQL database for data
= f"SELECT S.name,S.latitude,S.longitude,C.Name,T.temp,T.year,T.month \
cmd FROM temperatures T \
LEFT JOIN stations S \
ON T.id = S.id \
LEFT JOIN countries C \
ON C.[FIPS 10-4] = SUBSTRING(T.id,1,2) \
WHERE (year >= {year_begin}) \
AND (year <= {year_end}) \
AND (month == {month}) \
AND (SUBSTRING(T.id,1,2) == '{fips}')"
return pd.read_sql_query(cmd,conn)
Here’s an example of what our returned dataframe looks like:
=query_climate_database(country = "India",
df= 1980,
year_begin = 2020,
year_end = 1)
month df
NAME | LATITUDE | LONGITUDE | Name | Temp | Year | Month | |
---|---|---|---|---|---|---|---|
0 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 23.48 | 1980 | 1 |
1 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 24.57 | 1981 | 1 |
2 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 24.19 | 1982 | 1 |
3 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 23.51 | 1983 | 1 |
4 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 24.81 | 1984 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
3147 | DARJEELING | 27.050 | 88.270 | India | 5.10 | 1983 | 1 |
3148 | DARJEELING | 27.050 | 88.270 | India | 6.90 | 1986 | 1 |
3149 | DARJEELING | 27.050 | 88.270 | India | 8.10 | 1994 | 1 |
3150 | DARJEELING | 27.050 | 88.270 | India | 5.60 | 1995 | 1 |
3151 | DARJEELING | 27.050 | 88.270 | India | 5.70 | 1997 | 1 |
3152 rows × 7 columns
Part 3: Plotting
Now we’re ready to tackle our question. To calculate the average yearly change in temperature, we’ll be creating a linear regression of temperature vs. year and extracting the linear coefficient. For this to be meaningful, our function will be taking in a min_obs
parameter for the minimum amount of observations required for a station to be considered. If a station doesn’t have enough observations, we remove it from the dataframe before plotting.
Before we implement this in a function, let’s work through it with our India data. We can use the groupby()
function along with the transform()
function which, in contrast to apply()
, retains the indexing of the original dataframe.
#removes observations from stations with less than 10 observations
"NAME")["Year"].transform(lambda x:x.count())>=10] df[df.groupby(
NAME | LATITUDE | LONGITUDE | Name | Temp | Year | Month | |
---|---|---|---|---|---|---|---|
0 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 23.48 | 1980 | 1 |
1 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 24.57 | 1981 | 1 |
2 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 24.19 | 1982 | 1 |
3 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 23.51 | 1983 | 1 |
4 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 24.81 | 1984 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
3140 | SHILONG | 25.600 | 91.890 | India | 10.40 | 1986 | 1 |
3141 | SHILONG | 25.600 | 91.890 | India | 11.20 | 1990 | 1 |
3142 | SHILONG | 25.600 | 91.890 | India | 11.99 | 2010 | 1 |
3143 | SHILONG | 25.600 | 91.890 | India | 9.93 | 2011 | 1 |
3144 | SHILONG | 25.600 | 91.890 | India | 9.68 | 2012 | 1 |
3106 rows × 7 columns
Notice that we’ve removed data from the DARJEELING station, which only had 7 observations.
Next, we import from sklearn
, which you should already be familiar with, and write a function that finds the average yearly temperature increase using a linear regression model. We can then use this to add a column to the dataframe.
from sklearn.linear_model import LinearRegression
def lin_coeff(data):
'''
Returns linear coefficient of year vs. temp regression fitted to data
'''
= LinearRegression().fit(data[["Year"]],data["Temp"])
reg return round(reg.coef_[0],4)
=df.groupby(["NAME","LATITUDE","LONGITUDE"]).apply(lin_coeff).reset_index()
data data
NAME | LATITUDE | LONGITUDE | 0 | |
---|---|---|---|---|
0 | AGARTALA | 23.8830 | 91.2500 | -0.0062 |
1 | AGRA | 27.1667 | 78.0333 | -0.0954 |
2 | AHMADABAD | 23.0670 | 72.6330 | 0.0067 |
3 | AKOLA | 20.7000 | 77.0330 | -0.0018 |
4 | AKOLA | 20.7000 | 77.0670 | -0.0059 |
... | ... | ... | ... | ... |
101 | TRIVANDRUM | 8.5000 | 77.0000 | 0.0229 |
102 | UDAIPUR_DABOK | 24.6170 | 73.8830 | 0.0724 |
103 | VARANASI_BABATPUR | 25.4500 | 82.8670 | -0.0130 |
104 | VERAVAL | 20.9000 | 70.3670 | 0.0248 |
105 | VISHAKHAPATNAM | 17.7170 | 83.2330 | -0.0340 |
106 rows × 4 columns
Things are looking pretty good! It’s not ideal that our column title is just “0”, but at this point we’re ready to write our plotting function using what we’ve just figured out. We’ll be relying on the plotly.express
module to do so.
from plotly import express as px
import plotly.io as pio
= 'iframe' pio.renderers.default
= {1 : "January",
month_dict 2 : "February",
3 : "March",
4 : "April",
5 : "May",
6 : "June",
7 : "July",
8 : "August",
9 : "September",
10 : "October",
11 : "November",
12 : "December"}
def temperature_coefficient_plot(country,year_begin,year_end,month,min_obs,**kwargs):
'''
Plots average yearly increase in temperature at stations in given country
Parameters:
country - country from which station data is taken
year_begin - earliest year from which data is taken
year_end - latest year from which data is taken
month - month from which data is taken (prevents seasonal changes in temperature
from influencing data)
min_obs - minimum number of observations required per station
'''
= query_climate_database(country,year_begin,year_end,month)
df
#remove station data without enough observations
= df[df.groupby("NAME")["Year"].transform(lambda x:x.count()) >= min_obs]
df
#calculate average yearly increase at each station
= df.groupby(["NAME","LATITUDE","LONGITUDE"]).apply(lin_coeff).reset_index()
df = df.rename(columns={0:"Estimated Yearly Increase (\u00b0C)"})
df
#creates scatterplot with color corresponding to change in temperature
= px.scatter_mapbox(df,
fig = "LATITUDE",
lat = "LONGITUDE",
lon = "NAME",
hover_name = "Estimated Yearly Increase (\u00b0C)",
color = 0,
color_continuous_midpoint = f"Average yearly increase in temperature in {month_dict[month]}<br>for stations in {country} from {year_begin} - {year_end}",
title **kwargs)
return fig
Here are two examples of what the graph looks like:
#set color scale for points
= px.colors.diverging.RdGy_r
color_map
= temperature_coefficient_plot("India", 1980, 2020, 1,
fig = 10,
min_obs = 2,
zoom ="carto-positron",
mapbox_style=color_map)
color_continuous_scale fig.show()
= temperature_coefficient_plot("China", 1975, 2015, 8,
fig = 10,
min_obs = 2,
zoom ="carto-positron",
mapbox_style=color_map)
color_continuous_scale fig.show()
Part 4: More Fun Stuff
Now that we’re more familiar with SQL and plotly
, let’s ask some more questions we can use our tools to visualize. For example,
How does longitude affect how greatly temperature varies across time?
The process is more or less the same as before: we first define a function that uses a SQL query to return the data that we want, and then we write a plotter function that uses plotly.express
.
def query_lon_temp(lat_min,lat_max):
'''
Returns dataframe containing longitudinal and temperature data within given latitude range
'''
= f"SELECT S.longitude,T.temp,T.year,T.month \
cmd FROM temperatures T \
LEFT JOIN stations S \
ON T.id = S.id \
WHERE (S.latitude >= {lat_min}) AND (S.latitude <= {lat_max})"
return pd.read_sql_query(cmd,conn)
def scatter_temp_lon_all(lat_min,lat_max):
'''
Plots standard deviation of temperature data at each longitude for stations within
given latitude range
'''
= query_lon_temp(lat_min,lat_max)
df
#calculate standard deviation of temperature data at each longitude
= df.groupby("LONGITUDE").apply(lambda x:x["Temp"].std()).reset_index()
df = df.rename(columns={0:"Temp StdDev (\u00b0C)"})
df
= px.scatter(data_frame = df,
fig = "LONGITUDE",
x = "Temp StdDev (\u00b0C)",
y = f"Temperature Variation vs. Longitude at Latitudes {lat_min}\u00b0 to {lat_max}\u00b0",
title = 0.5)
opacity return fig
-10,10) scatter_temp_lon_all(
We see the most variation around the tropics, in the middle of each hemisphere, while there is very little variation at the poles. While temperature at the equator varies somewhat, how much it varies stays relatively consistent.
A simpler question we can ask is:
What is the distribution of temperatures at each station?
Obviously there are too many stations to all be reasonably visualized, so our functions take in a list of stations to be plotted.
def query_temps(stations,month):
'''
Returns dataframe with data read from given stations at a particular month
'''
= f"SELECT S.name,T.temp,T.year \
cmd FROM temperatures T \
LEFT JOIN stations S \
ON T.id = S.id \
WHERE (S.name IN {tuple(stations)}) AND (T.month == {month})"
return pd.read_sql_query(cmd,conn)
def hist_temps(stations, month):
'''
Plots histogram of temperature distribution with stations as facets
'''
= query_temps(stations, month)
df = df.rename(columns={"Temp":"Temperature Readings (\u00b0C)","NAME":"Station"})
df
= px.histogram(df,
fig = "Temperature Readings (\u00b0C)",
x = 30,
nbins = "Station",
facet_row = f"Distribution of Temperature Readings (\u00b0C) in {month_dict[month]}")
title
return fig
Here’s what it looks like:
= ["DUBAI_INTL","HERAT","SAVE"]
stations 3) hist_temps(stations,
We’re pretty much done, but one last thing: don’t forget to close your connection to the database!
conn.close()